Skip to content

feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)#192

Merged
AdaWorldAPI merged 2 commits into
masterfrom
claude/continue-ndarray-x0Oaw
May 21, 2026
Merged

feat(simd_caps): CPUID 7,1 + new x86 caps fields + AMX OS-gate in cpu_ops (salvage from #190)#192
AdaWorldAPI merged 2 commits into
masterfrom
claude/continue-ndarray-x0Oaw

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Salvages the detection-only subset of the closed PR #190 — three real gaps in the substrate runtime dispatch — without inheriting any of #190's consumer-facing additions. No SimdProfile enum, no cpu-* cargo features, no public dispatch-identity API; crate::simd::* remains the sole consumer entry point.

What lands

Change File Purpose
CPUID leaf 7,1 read for AMX-FP16 src/hpc/simd_caps.rs Granite Rapids vs Sapphire Rapids discrimination; bit 21 of CPUID.07H.1H:EAX
New SimdCaps fields: avx512fp16, avx512vp2intersect, amx_fp16 same Additive (default false on non-x86); ready for future kernels that route on FP16 / VP2INTERSECT
Convenience methods has_avx512_fp16(), has_amx_fp16() same Mirror the existing has_amx, has_avx512_bf16 etc. patterns. has_amx_fp16() defense-in-depths the amx_tile bit.
AMX OS-state gate in cpu_ops() selection src/simd_runtime/cpu_ops.rs AND-gates the AMX-INT8 tier on simd_amx::amx_available() (the existing 4-step CPUID + OSXSAVE + XCR0 + arch_prctl check). Closes the SIGILL hole when a hypervisor masks XCR0 bits 17/18 or the OS hasn't honoured arch_prctl(XCOMP_PERM, 18) on Linux 5.19+.

What's deliberately NOT here (rejected from #190)

Surface Why rejected
SimdProfile 14-variant enum + simd_profile() accessor Exposes dispatch identity to consumer code; invites match profile { ... } arms — the polyfill-defeat pattern.
cpu-* cargo features (13 mutually-exclusive flags) Build-time silicon pinning; consumer codes against pinned silicon, same broken invariant at earlier binding time.
pub use ... SimdProfile from src/simd.rs Places the dispatch discriminant on the consumer surface — worst-of-both-worlds.
simd_profile_probe example binary Diagnostic-only, rebuilds the SimdProfile surface this PR doesn't bring.

Why now

The AMX OS-gate fix is load-bearing on Sapphire Rapids hosts where the hypervisor masks the tile XSAVE state — currently cpu_ops() routes to CPU_OPS_AMX_INT8, calls SIGILL on the first AMX instruction. After this PR it demotes to CPU_OPS_AVX512_VNNI cleanly without the consumer noticing. The bit-exact polyfill contract still holds (both tiers produce identical results — that's the W1a guarantee); the choice between them was just runtime-broken on AMX-but-OS-blocked hosts.

The new SimdCaps fields don't feed any selection yet; they're laying the runway. When an AMX-FP16 kernel lands (probably tied to the BF16/FP16 work hinted at in the matrix doc § J Phase 3b), the new tier would AND-gate on caps.amx_fp16 && simd_amx::amx_available() between the AMX-INT8 and AVX-512-VNNI arms.

Tests

  • 4 new simd_caps::tests: cpuid_extended_bits_smoke, has_amx_fp16_requires_amx_tile, x86_extended_bits_are_false_on_non_x86, plus extended determinism coverage of the new fields.
  • All 6 existing cpu_ops::tests still pass; the AMX OS-gate change passes through transparently on hosts where amx_available() agrees with CPUID (the typical case).
  • cargo build --no-default-features clean (new fields zero-init on non-x86 / scalar stubs).
  • cargo fmt --all --check clean.
  • cargo clippy --features runtime-dispatch --lib -- -D warnings clean.

Relation to closed #190

The closed PR's simd_caps.rs additions were correct and necessary; only the architectural overreach (SimdProfile / cpu-* / public dispatch-identity API) was the problem. This PR is the minimal extraction — ~130 lines, one commit, zero new public types.


Generated by Claude Code

claude added 2 commits May 21, 2026 12:05
… AMX OS-gate in cpu_ops

Salvages the detection-only subset of closed PR #190 — three real
gaps in the substrate runtime dispatch without inheriting any of
PR #190's consumer-facing additions (no SimdProfile enum, no
public dispatch-identity API, no cpu-* features).

What lands here:

1) CPUID leaf 7,1 read for AMX-FP16 (CPUID.07H.1H:EAX bit 21).
   Lives on a different subleaf than the existing AMX bits;
   GraniteRapids is the only silicon advertising it today.
   Guarded by leaf 7,0 EAX >= 1 so older CPUs that don't expose
   subleaf 1 stay correct.

2) Three new SimdCaps fields (additive, all default false on
   non-x86):
   - avx512fp16  — CPUID.07H.0H:EDX bit 23 — `__m512h` math.
                   Discriminates SPR-class from CascadeLake/
                   IceLakeSp/SkylakeX for any future FP16 kernel.
   - avx512vp2intersect — CPUID.07H.0H:EDX bit 8 — TigerLake
                   mobile only; absent from Ice Lake-SP and every
                   later server part. Exposed for completeness.
   - amx_fp16     — CPUID.07H.1H:EAX bit 21 — Granite Rapids.

   Plus convenience methods has_avx512_fp16() and has_amx_fp16()
   (the latter defense-in-depths the amx_tile bit).

3) AMX OS-state gate in cpu_ops() selection. The CPU-reports-AMX
   path now AND-gates on `simd_amx::amx_available()` which runs
   the full four-step check (CPUID + OSXSAVE + XCR0[17,18] +
   arch_prctl(XCOMP_PERM, 18) on Linux 5.19+). This closes the
   SIGILL hole when a hypervisor masks XCR0 or the OS hasn't
   honoured the prctl: previously cpu_ops() would route to
   CPU_OPS_AMX_INT8 and AMX instructions would SIGILL despite
   the CPUID bit. Now it demotes to CPU_OPS_AVX512_VNNI cleanly.

What's deliberately NOT here (rejected from PR #190):

- No `SimdProfile` enum — would expose dispatch identity to
  consumer code and invite `match profile { ... }` arms that
  defeat the polyfill contract.
- No `cpu-*` cargo features — build-time silicon pinning that
  defeats polyfill at an earlier binding time.
- No `simd_profile_probe` example — diagnostic-only, rebuilds
  the SimdProfile surface this PR doesn't bring.
- No public dispatch-identity API at any layer. The new bits
  are internal substrate detection; consumers continue to use
  `crate::simd::*` polyfilled types and `crate::simd_runtime::*`
  per-op trampolines.

The new fields slot into existing `cpu_ops()` selection by
extension (e.g. a future AMX-FP16 tier would AND-gate on
`caps.amx_fp16 && simd_amx::amx_available()` between the
AMX-INT8 and AVX-512-VNNI arms). No selection logic uses them
yet — they're laying the runway, not consuming it.

Tests:
- 4 new simd_caps tests: cpuid_extended_bits_smoke,
  has_amx_fp16_requires_amx_tile,
  x86_extended_bits_are_false_on_non_x86, plus extended
  determinism coverage.
- All 6 existing cpu_ops tests still pass; the AMX OS-gate
  change passes through transparently on hosts where
  amx_available() agrees with CPUID (the typical case).
- fmt + clippy clean on `--features runtime-dispatch`.
Same canonical-fmt collapse as the prior pillar-branch hotfixes.
No behavioral change.
@AdaWorldAPI AdaWorldAPI merged commit adeba8a into master May 21, 2026
17 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
…t re-export

First step in the substrate-graduation thread documented in #192's
wrap-up: lift the substrate-tier modules out of `hpc/` (which was
the rustynum migration staging area) to crate root, where they sit
in scope of the W1a polyfill contract and no longer carry the
spurious `std`-gate inherited from `hpc/`.

`simd_caps` is the smallest and cleanest first move:
  * No internal `hpc/` dependencies (only `use std::sync::LazyLock`).
  * 8 internal callers; back-compat re-export keeps them working.
  * Pure CPU-detection metadata; the most polyfill-adjacent module
    in the entire `hpc/` set.

Changes:

1. `src/hpc/simd_caps.rs` → `src/simd_caps.rs` (file move).
2. `src/lib.rs` adds `#[cfg(feature = "std")] pub mod simd_caps;`.
   The std-gate is retained for now (uses `std::sync::LazyLock`);
   lifting it to `core::sync::LazyLock` is a separate follow-up.
3. `src/hpc/mod.rs` replaces `pub mod simd_caps;` with
   `pub use crate::simd_caps;` — keeps `crate::hpc::simd_caps::*`
   resolving for cross-repo consumers (lance-graph, WoA, MedCare,
   q2 may have `use ndarray::hpc::simd_caps::*` imports that this
   preserves untouched).

No public-API breakage; the test suite picks up the new path
(test names now `simd_caps::tests::*` rather than `hpc::simd_caps::*`),
all 10 tests pass under both default and `runtime-dispatch` configs.
The 8 internal callers (`crate::simd_avx512`, `crate::hpc::p64_bridge`,
`crate::simd_runtime::{cpu_ops, add_mul, vnni_dot}`) continue using
`crate::hpc::simd_caps::*` via the re-export and work unmodified.

Next graduation candidates (deferred to follow-up PRs):
  - `fingerprint` (bitwise substrate; raw `u64` polyfill audit)
  - `dn_tree` (bitwise substrate; same audit)
  - `ogit_bridge` (pure logic, no SIMD primitives)
  - `splat3d` (already uses `crate::simd::*` polyfilled types)

Each move follows the same pattern: relocate file, drop std-gate
inheritance where unneeded, keep back-compat re-export. Cognitive
layer (pillar, plane, seal, merkle_tree, deepnsm, …) stays inside
`hpc/` and keeps its legitimate std-gate.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Continues the substrate-graduation thread documented in #192's wrap-up
and extended in #193's simd_caps lift. Five more modules move from
`crate::hpc::*` (the rustynum migration staging area) to crate root
where they sit alongside `simd.rs`, `simd_runtime/`, `simd_caps`, and
the W1a polyfill surface they're supposed to compose with.

| Module | Reason |
|---|---|
| `bitwise`        | Pure SIMD primitives (popcount, hamming over byte slices); already uses `crate::simd::U64x8` polyfill internally; already re-exported via `simd.rs:512`. |
| `heel_f64x8`     | All-F64x8 polyfill consumer (dot, cosine, sum-sq, weighted-hamming); already re-exported via `simd.rs:563`. |
| `distance`       | Spatial 3D + slice-shape L1/L2/L∞ (PR-X10 A6); the linalg/mod.rs hard-boundary comment now points here at root. |
| `byte_scan`      | Pure SIMD utility (needle search, delimiter find). |
| `spatial_hash`   | Pure SIMD utility (bucketing, candidate gather). |

# Why these five, why now

All five satisfied the low-hanging-fruit criteria from #193's wrap-up
discussion:
  1. No internal `hpc/` dependencies (only `super::simd_caps` which
     still resolves correctly because `simd_caps` is itself at crate
     root post-#192).
  2. Already polyfill-clean — no raw-intrinsic refactor needed before
     the move.
  3. Already partially exposed via `crate::simd::*` re-exports.

The next graduation tier (`fingerprint`, `dn_tree`, `ogit_bridge`,
`splat3d`) needs a polyfill audit before it can move, and
`fingerprint` in particular is gated on the W1a-#5 POPCOUNT-U64
primitive landing (so its bit ops can route through `U64xN.popcnt()`
instead of raw `u64.count_ones()`).

# Back-compat preserved end-to-end

Every cross-repo consumer using `ndarray::hpc::{bitwise, heel_f64x8,
distance, byte_scan, spatial_hash}::*` continues to compile
unmodified. The `src/hpc/mod.rs` declarations change from
`pub mod X;` to `pub use crate::X;` — Rust re-exports modules just
like other items, so `crate::hpc::X::*` resolves through to the same
items as `crate::X::*`. Internal `super::simd_caps::simd_caps()`
calls inside the moved files continue to work because `super::` at
crate root resolves to `crate::*` which has `simd_caps` (graduated
in #192).

# Changes

- `git mv` five files from `src/hpc/` to `src/`.
- `src/lib.rs` gains five `#[cfg(feature = "std")] pub mod X;`
  declarations next to the existing `simd_caps` block, each with a
  one-liner docstring naming the graduation source and the
  substrate-tier reason for the move.
- `src/hpc/mod.rs` replaces five `pub mod X;` with `pub use crate::X;`
  (back-compat re-exports).
- `src/hpc/linalg/mod.rs` updates the hard-boundary comment from
  "No distance metrics — those live in `crate::hpc::distance`" to
  point at `crate::distance` (the new canonical path) with a
  parenthetical noting the back-compat re-export.
- The `bitwise.rs` declaration in `src/hpc/mod.rs` is now a comment
  instead of being interleaved with `pub mod hdc`/`pub mod projection`
  to make the graduation status visible at a glance.

# Verification

- `cargo build -p ndarray --lib` — clean
- `cargo build -p ndarray --lib --no-default-features` — clean
  (the new `#[cfg(feature = "std")]` gates match the existing
  `simd_caps` pattern; nostd targets see no change)
- `cargo test -p ndarray --lib bitwise:: distance:: heel_f64x8::
  byte_scan:: spatial_hash::` — all 119 tests on the five graduated
  modules pass at the new path (test names now `bitwise::tests::*`
  rather than `hpc::bitwise::tests::*`)
- `cargo test -p ndarray --lib --features "pillar,ogit_bridge,
  runtime-dispatch" hpc::` — 2167 passed, 0 failed, 28 ignored
- `cargo fmt --all --check` — clean
- `cargo clippy --features "pillar,ogit_bridge,runtime-dispatch"
  --lib -- -D warnings` — clean

# Next graduation candidates (deferred)

- `hpc::fingerprint` — needs W1a-#5 POPCOUNT-U64 to land first so
  bit ops can route through `U64xN.popcnt()` instead of raw
  `u64.count_ones()`. Cognitive-shader-foundation explicitly names
  `Fingerprint<N>` as a MUST-be-in-`ndarray::simd::*` type.
- `hpc::dn_tree` (bitwise core) — same polyfill-audit dependency.
  The cognitive DNTree/DNConfig/TraversalHit state stays in
  `hpc/` after the split.
- `hpc::ogit_bridge` — pure logic, no SIMD, can move once the
  fingerprint + dn_tree audits are out of the way (avoids three
  partial graduations in flight at once).
- `hpc::splat3d` — already mostly polyfill-clean; pure path
  rewrite. Defer because it's a larger consumer surface than
  the five in this PR.
AdaWorldAPI pushed a commit that referenced this pull request May 21, 2026
Continues the substrate-graduation thread from #192 (simd_caps),
#193 (clippy/doc cleanup), and #194 (bitwise/heel_f64x8/distance/
byte_scan/spatial_hash). Same low-hanging-fruit criteria — no
internal hpc/ deps, polyfill-clean, single-line back-compat shim
keeps every existing import resolving.

| Module          | Reason                                                         |
|---|---|
| `aabb`          | SIMD AABB intersection/expansion/distance; only deps are       |
|                 | `crate::simd::F32x16` + `super::simd_caps` (graduated #192).   |
| `nibble`        | 4-bit packed nibble batch ops; only dep is `crate::simd::U8x64`.|
| `palette_codec` | Variable-width palette index codec (1-8 bit packing); zero deps.|
| `property_mask` | AVX-512 VPTERNLOGD bitset queries on block state bits;         |
|                 | only dep is `crate::simd::U64x8`.                              |

# Why these four, why now

All four satisfy the criteria from #194's wrap-up:
  1. No internal `hpc/` dependencies — only `crate::simd::*`
     (polyfill surface) and `super::simd_caps` (which is itself
     at crate root post-#192).
  2. Polyfill-clean — no raw-intrinsic refactor required.
  3. Single in-tree downstream caller (`hpc::framebuffer` uses
     `palette_codec`) → the `pub use crate::palette_codec;`
     back-compat shim keeps that resolution working zero-touch.

# Mechanical changes

- `git mv src/hpc/{aabb,nibble,palette_codec,property_mask}.rs src/`
- `src/lib.rs`: added four `pub mod` declarations under
  `#[cfg(feature = "std")]`, each with a `# Example` rustdoc block
  per CLAUDE.md "all public APIs need doc comments with examples".
- `src/hpc/mod.rs`: replaced the four `pub mod` declarations with
  `pub use crate::{aabb, nibble, palette_codec, property_mask};`
  back-compat re-exports. `crate::hpc::aabb::*` and friends keep
  resolving for every existing call site, identical to how
  `crate::hpc::bitwise::*` works post-#194.

# Clippy / lint cleanup

17 clippy errors surfaced under `-D warnings` once the modules
left the `hpc/mod.rs` `#![allow(clippy::all, ...)]` umbrella.
Fixed each at the canonical Rust idiom (the #194 cleanup pattern,
417131b), no umbrella re-application:

- **manual_div_ceil (6 sites)** — `(n + d - 1) / d` → `n.div_ceil(d)`
  in `nibble.rs` (x2), `palette_codec.rs` (x3), `property_mask.rs`.
- **needless_range_loop (10 sites)** — `for i in start..vec.len()`
  rewrites to `for x in &vec[start..]` (when index unused) or
  `for (i, &x) in iter().enumerate().skip(start)` (when index used).
  Sites: `aabb.rs` x4, `nibble.rs` x3, `palette_codec.rs` x1,
  `property_mask.rs` x2.
- **missing_docs (4 sites)** — added field doc comments on
  `pub struct Aabb { min, max }` and `pub struct Ray { origin,
  inv_dir }`. Previously masked by the `hpc/mod.rs` umbrella's
  `#![allow(missing_docs)]`.

# Doctest correction

Initial `# Example` in `src/lib.rs` for `palette_codec` asserted
`bits_for_palette_size(1) == 1` per the module's own docstring
table, but the impl returns 0 for `palette_size <= 1` (trivial-
palette special case). Changed assertion to use `bits_for_palette_
size(2) == 1` — exercises the same code path with input the impl
actually handles per spec.

# Verification

```
cargo check --lib                                          green
cargo clippy --lib -- -D warnings                          green
cargo clippy --lib --features rayon -- -D warnings         green
cargo clippy --features approx,serde,rayon -- -D warnings  green
cargo test --doc (15 graduated-module doctests)            pass
cargo test --lib (104 unit tests across 4 modules)         pass
```

# What's next

`hpc/` inventory: ~55 → ~51 modules at the staging path. Next-batch
candidates per the same criteria need a deps audit before move:
`framebuffer` (uses `palette_codec` shim, otherwise crate-root),
`ocr_simd`/`ocr_felt`, `audio`. Filed in AGENT_LOG entry for the
follow-up pass.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants